<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
    "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />
<meta name="generator" content="AsciiDoc 8.6.8" />
<title>Gene Structure Annotation and Analysis Using PASA</title>
<style type="text/css">
/* Shared CSS for AsciiDoc xhtml11 and html5 backends */

/* Default font. */
body {
  font-family: Georgia,serif;
}

/* Title font. */
h1, h2, h3, h4, h5, h6,
div.title, caption.title,
thead, p.table.header,
#toctitle,
#author, #revnumber, #revdate, #revremark,
#footer {
  font-family: Arial,Helvetica,sans-serif;
}

body {
  margin: 1em 5% 1em 5%;
}

a {
  color: blue;
  text-decoration: underline;
}
a:visited {
  color: fuchsia;
}

em {
  font-style: italic;
  color: navy;
}

strong {
  font-weight: bold;
  color: #083194;
}

h1, h2, h3, h4, h5, h6 {
  color: #527bbd;
  margin-top: 1.2em;
  margin-bottom: 0.5em;
  line-height: 1.3;
}

h1, h2, h3 {
  border-bottom: 2px solid silver;
}
h2 {
  padding-top: 0.5em;
}
h3 {
  float: left;
}
h3 + * {
  clear: left;
}
h5 {
  font-size: 1.0em;
}

div.sectionbody {
  margin-left: 0;
}

hr {
  border: 1px solid silver;
}

p {
  margin-top: 0.5em;
  margin-bottom: 0.5em;
}

ul, ol, li > p {
  margin-top: 0;
}
ul > li     { color: #aaa; }
ul > li > * { color: black; }

.monospaced, code, pre {
  font-family: "Courier New", Courier, monospace;
  font-size: inherit;
  color: navy;
  padding: 0;
  margin: 0;
}


#author {
  color: #527bbd;
  font-weight: bold;
  font-size: 1.1em;
}
#email {
}
#revnumber, #revdate, #revremark {
}

#footer {
  font-size: small;
  border-top: 2px solid silver;
  padding-top: 0.5em;
  margin-top: 4.0em;
}
#footer-text {
  float: left;
  padding-bottom: 0.5em;
}
#footer-badges {
  float: right;
  padding-bottom: 0.5em;
}

#preamble {
  margin-top: 1.5em;
  margin-bottom: 1.5em;
}
div.imageblock, div.exampleblock, div.verseblock,
div.quoteblock, div.literalblock, div.listingblock, div.sidebarblock,
div.admonitionblock {
  margin-top: 1.0em;
  margin-bottom: 1.5em;
}
div.admonitionblock {
  margin-top: 2.0em;
  margin-bottom: 2.0em;
  margin-right: 10%;
  color: #606060;
}

div.content { /* Block element content. */
  padding: 0;
}

/* Block element titles. */
div.title, caption.title {
  color: #527bbd;
  font-weight: bold;
  text-align: left;
  margin-top: 1.0em;
  margin-bottom: 0.5em;
}
div.title + * {
  margin-top: 0;
}

td div.title:first-child {
  margin-top: 0.0em;
}
div.content div.title:first-child {
  margin-top: 0.0em;
}
div.content + div.title {
  margin-top: 0.0em;
}

div.sidebarblock > div.content {
  background: #ffffee;
  border: 1px solid #dddddd;
  border-left: 4px solid #f0f0f0;
  padding: 0.5em;
}

div.listingblock > div.content {
  border: 1px solid #dddddd;
  border-left: 5px solid #f0f0f0;
  background: #f8f8f8;
  padding: 0.5em;
}

div.quoteblock, div.verseblock {
  padding-left: 1.0em;
  margin-left: 1.0em;
  margin-right: 10%;
  border-left: 5px solid #f0f0f0;
  color: #888;
}

div.quoteblock > div.attribution {
  padding-top: 0.5em;
  text-align: right;
}

div.verseblock > pre.content {
  font-family: inherit;
  font-size: inherit;
}
div.verseblock > div.attribution {
  padding-top: 0.75em;
  text-align: left;
}
/* DEPRECATED: Pre version 8.2.7 verse style literal block. */
div.verseblock + div.attribution {
  text-align: left;
}

div.admonitionblock .icon {
  vertical-align: top;
  font-size: 1.1em;
  font-weight: bold;
  text-decoration: underline;
  color: #527bbd;
  padding-right: 0.5em;
}
div.admonitionblock td.content {
  padding-left: 0.5em;
  border-left: 3px solid #dddddd;
}

div.exampleblock > div.content {
  border-left: 3px solid #dddddd;
  padding-left: 0.5em;
}

div.imageblock div.content { padding-left: 0; }
span.image img { border-style: none; }
a.image:visited { color: white; }

dl {
  margin-top: 0.8em;
  margin-bottom: 0.8em;
}
dt {
  margin-top: 0.5em;
  margin-bottom: 0;
  font-style: normal;
  color: navy;
}
dd > *:first-child {
  margin-top: 0.1em;
}

ul, ol {
    list-style-position: outside;
}
ol.arabic {
  list-style-type: decimal;
}
ol.loweralpha {
  list-style-type: lower-alpha;
}
ol.upperalpha {
  list-style-type: upper-alpha;
}
ol.lowerroman {
  list-style-type: lower-roman;
}
ol.upperroman {
  list-style-type: upper-roman;
}

div.compact ul, div.compact ol,
div.compact p, div.compact p,
div.compact div, div.compact div {
  margin-top: 0.1em;
  margin-bottom: 0.1em;
}

tfoot {
  font-weight: bold;
}
td > div.verse {
  white-space: pre;
}

div.hdlist {
  margin-top: 0.8em;
  margin-bottom: 0.8em;
}
div.hdlist tr {
  padding-bottom: 15px;
}
dt.hdlist1.strong, td.hdlist1.strong {
  font-weight: bold;
}
td.hdlist1 {
  vertical-align: top;
  font-style: normal;
  padding-right: 0.8em;
  color: navy;
}
td.hdlist2 {
  vertical-align: top;
}
div.hdlist.compact tr {
  margin: 0;
  padding-bottom: 0;
}

.comment {
  background: yellow;
}

.footnote, .footnoteref {
  font-size: 0.8em;
}

span.footnote, span.footnoteref {
  vertical-align: super;
}

#footnotes {
  margin: 20px 0 20px 0;
  padding: 7px 0 0 0;
}

#footnotes div.footnote {
  margin: 0 0 5px 0;
}

#footnotes hr {
  border: none;
  border-top: 1px solid silver;
  height: 1px;
  text-align: left;
  margin-left: 0;
  width: 20%;
  min-width: 100px;
}

div.colist td {
  padding-right: 0.5em;
  padding-bottom: 0.3em;
  vertical-align: top;
}
div.colist td img {
  margin-top: 0.3em;
}

@media print {
  #footer-badges { display: none; }
}

#toc {
  margin-bottom: 2.5em;
}

#toctitle {
  color: #527bbd;
  font-size: 1.1em;
  font-weight: bold;
  margin-top: 1.0em;
  margin-bottom: 0.1em;
}

div.toclevel0, div.toclevel1, div.toclevel2, div.toclevel3, div.toclevel4 {
  margin-top: 0;
  margin-bottom: 0;
}
div.toclevel2 {
  margin-left: 2em;
  font-size: 0.9em;
}
div.toclevel3 {
  margin-left: 4em;
  font-size: 0.9em;
}
div.toclevel4 {
  margin-left: 6em;
  font-size: 0.9em;
}

span.aqua { color: aqua; }
span.black { color: black; }
span.blue { color: blue; }
span.fuchsia { color: fuchsia; }
span.gray { color: gray; }
span.green { color: green; }
span.lime { color: lime; }
span.maroon { color: maroon; }
span.navy { color: navy; }
span.olive { color: olive; }
span.purple { color: purple; }
span.red { color: red; }
span.silver { color: silver; }
span.teal { color: teal; }
span.white { color: white; }
span.yellow { color: yellow; }

span.aqua-background { background: aqua; }
span.black-background { background: black; }
span.blue-background { background: blue; }
span.fuchsia-background { background: fuchsia; }
span.gray-background { background: gray; }
span.green-background { background: green; }
span.lime-background { background: lime; }
span.maroon-background { background: maroon; }
span.navy-background { background: navy; }
span.olive-background { background: olive; }
span.purple-background { background: purple; }
span.red-background { background: red; }
span.silver-background { background: silver; }
span.teal-background { background: teal; }
span.white-background { background: white; }
span.yellow-background { background: yellow; }

span.big { font-size: 2em; }
span.small { font-size: 0.6em; }

span.underline { text-decoration: underline; }
span.overline { text-decoration: overline; }
span.line-through { text-decoration: line-through; }

div.unbreakable { page-break-inside: avoid; }


/*
 * xhtml11 specific
 *
 * */

div.tableblock {
  margin-top: 1.0em;
  margin-bottom: 1.5em;
}
div.tableblock > table {
  border: 3px solid #527bbd;
}
thead, p.table.header {
  font-weight: bold;
  color: #527bbd;
}
p.table {
  margin-top: 0;
}
/* Because the table frame attribute is overriden by CSS in most browsers. */
div.tableblock > table[frame="void"] {
  border-style: none;
}
div.tableblock > table[frame="hsides"] {
  border-left-style: none;
  border-right-style: none;
}
div.tableblock > table[frame="vsides"] {
  border-top-style: none;
  border-bottom-style: none;
}


/*
 * html5 specific
 *
 * */

table.tableblock {
  margin-top: 1.0em;
  margin-bottom: 1.5em;
}
thead, p.tableblock.header {
  font-weight: bold;
  color: #527bbd;
}
p.tableblock {
  margin-top: 0;
}
table.tableblock {
  border-width: 3px;
  border-spacing: 0px;
  border-style: solid;
  border-color: #527bbd;
  border-collapse: collapse;
}
th.tableblock, td.tableblock {
  border-width: 1px;
  padding: 4px;
  border-style: solid;
  border-color: #527bbd;
}

table.tableblock.frame-topbot {
  border-left-style: hidden;
  border-right-style: hidden;
}
table.tableblock.frame-sides {
  border-top-style: hidden;
  border-bottom-style: hidden;
}
table.tableblock.frame-none {
  border-style: hidden;
}

th.tableblock.halign-left, td.tableblock.halign-left {
  text-align: left;
}
th.tableblock.halign-center, td.tableblock.halign-center {
  text-align: center;
}
th.tableblock.halign-right, td.tableblock.halign-right {
  text-align: right;
}

th.tableblock.valign-top, td.tableblock.valign-top {
  vertical-align: top;
}
th.tableblock.valign-middle, td.tableblock.valign-middle {
  vertical-align: middle;
}
th.tableblock.valign-bottom, td.tableblock.valign-bottom {
  vertical-align: bottom;
}


/*
 * manpage specific
 *
 * */

body.manpage h1 {
  padding-top: 0.5em;
  padding-bottom: 0.5em;
  border-top: 2px solid silver;
  border-bottom: 2px solid silver;
}
body.manpage h2 {
  border-style: none;
}
body.manpage div.sectionbody {
  margin-left: 3em;
}

@media print {
  body.manpage div#toc { display: none; }
}


</style>
<script type="text/javascript">
/*<![CDATA[*/
var asciidoc = {  // Namespace.

/////////////////////////////////////////////////////////////////////
// Table Of Contents generator
/////////////////////////////////////////////////////////////////////

/* Author: Mihai Bazon, September 2002
 * http://students.infoiasi.ro/~mishoo
 *
 * Table Of Content generator
 * Version: 0.4
 *
 * Feel free to use this script under the terms of the GNU General Public
 * License, as long as you do not remove or alter this notice.
 */

 /* modified by Troy D. Hanson, September 2006. License: GPL */
 /* modified by Stuart Rackham, 2006, 2009. License: GPL */

// toclevels = 1..4.
toc: function (toclevels) {

  function getText(el) {
    var text = "";
    for (var i = el.firstChild; i != null; i = i.nextSibling) {
      if (i.nodeType == 3 /* Node.TEXT_NODE */) // IE doesn't speak constants.
        text += i.data;
      else if (i.firstChild != null)
        text += getText(i);
    }
    return text;
  }

  function TocEntry(el, text, toclevel) {
    this.element = el;
    this.text = text;
    this.toclevel = toclevel;
  }

  function tocEntries(el, toclevels) {
    var result = new Array;
    var re = new RegExp('[hH]([1-'+(toclevels+1)+'])');
    // Function that scans the DOM tree for header elements (the DOM2
    // nodeIterator API would be a better technique but not supported by all
    // browsers).
    var iterate = function (el) {
      for (var i = el.firstChild; i != null; i = i.nextSibling) {
        if (i.nodeType == 1 /* Node.ELEMENT_NODE */) {
          var mo = re.exec(i.tagName);
          if (mo && (i.getAttribute("class") || i.getAttribute("className")) != "float") {
            result[result.length] = new TocEntry(i, getText(i), mo[1]-1);
          }
          iterate(i);
        }
      }
    }
    iterate(el);
    return result;
  }

  var toc = document.getElementById("toc");
  if (!toc) {
    return;
  }

  // Delete existing TOC entries in case we're reloading the TOC.
  var tocEntriesToRemove = [];
  var i;
  for (i = 0; i < toc.childNodes.length; i++) {
    var entry = toc.childNodes[i];
    if (entry.nodeName.toLowerCase() == 'div'
     && entry.getAttribute("class")
     && entry.getAttribute("class").match(/^toclevel/))
      tocEntriesToRemove.push(entry);
  }
  for (i = 0; i < tocEntriesToRemove.length; i++) {
    toc.removeChild(tocEntriesToRemove[i]);
  }

  // Rebuild TOC entries.
  var entries = tocEntries(document.getElementById("content"), toclevels);
  for (var i = 0; i < entries.length; ++i) {
    var entry = entries[i];
    if (entry.element.id == "")
      entry.element.id = "_toc_" + i;
    var a = document.createElement("a");
    a.href = "#" + entry.element.id;
    a.appendChild(document.createTextNode(entry.text));
    var div = document.createElement("div");
    div.appendChild(a);
    div.className = "toclevel" + entry.toclevel;
    toc.appendChild(div);
  }
  if (entries.length == 0)
    toc.parentNode.removeChild(toc);
},


/////////////////////////////////////////////////////////////////////
// Footnotes generator
/////////////////////////////////////////////////////////////////////

/* Based on footnote generation code from:
 * http://www.brandspankingnew.net/archive/2005/07/format_footnote.html
 */

footnotes: function () {
  // Delete existing footnote entries in case we're reloading the footnodes.
  var i;
  var noteholder = document.getElementById("footnotes");
  if (!noteholder) {
    return;
  }
  var entriesToRemove = [];
  for (i = 0; i < noteholder.childNodes.length; i++) {
    var entry = noteholder.childNodes[i];
    if (entry.nodeName.toLowerCase() == 'div' && entry.getAttribute("class") == "footnote")
      entriesToRemove.push(entry);
  }
  for (i = 0; i < entriesToRemove.length; i++) {
    noteholder.removeChild(entriesToRemove[i]);
  }

  // Rebuild footnote entries.
  var cont = document.getElementById("content");
  var spans = cont.getElementsByTagName("span");
  var refs = {};
  var n = 0;
  for (i=0; i<spans.length; i++) {
    if (spans[i].className == "footnote") {
      n++;
      var note = spans[i].getAttribute("data-note");
      if (!note) {
        // Use [\s\S] in place of . so multi-line matches work.
        // Because JavaScript has no s (dotall) regex flag.
        note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
        spans[i].innerHTML =
          "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
          "' title='View footnote' class='footnote'>" + n + "</a>]";
        spans[i].setAttribute("data-note", note);
      }
      noteholder.innerHTML +=
        "<div class='footnote' id='_footnote_" + n + "'>" +
        "<a href='#_footnoteref_" + n + "' title='Return to text'>" +
        n + "</a>. " + note + "</div>";
      var id =spans[i].getAttribute("id");
      if (id != null) refs["#"+id] = n;
    }
  }
  if (n == 0)
    noteholder.parentNode.removeChild(noteholder);
  else {
    // Process footnoterefs.
    for (i=0; i<spans.length; i++) {
      if (spans[i].className == "footnoteref") {
        var href = spans[i].getElementsByTagName("a")[0].getAttribute("href");
        href = href.match(/#.*/)[0];  // Because IE return full URL.
        n = refs[href];
        spans[i].innerHTML =
          "[<a href='#_footnote_" + n +
          "' title='View footnote' class='footnote'>" + n + "</a>]";
      }
    }
  }
},

install: function(toclevels) {
  var timerId;

  function reinstall() {
    asciidoc.footnotes();
    if (toclevels) {
      asciidoc.toc(toclevels);
    }
  }

  function reinstallAndRemoveTimer() {
    clearInterval(timerId);
    reinstall();
  }

  timerId = setInterval(reinstall, 500);
  if (document.addEventListener)
    document.addEventListener("DOMContentLoaded", reinstallAndRemoveTimer, false);
  else
    window.onload = reinstallAndRemoveTimer;
}

}
asciidoc.install();
/*]]>*/
</script>
</head>
<body class="article">
<div id="header">
<h1>Gene Structure Annotation and Analysis Using PASA</h1>
</div>
<div id="content">
<div id="preamble">
<div class="sectionbody">
<div class="paragraph"><p><span class="image">
<img src="images/PASA_logo.jpg" alt="PASA_logo" height="75" />
</span></p></div>
<div class="paragraph"><p>PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data.  PASA also identifies and classifies all splicing variations supported by the transcript alignments.</p></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">Now available: A hybrid approach to transcript reconstruction using genome-guided and de novo RNA-Seq assemblies to generate a <a href="#A_ComprehensiveTranscriptome">comprehensive transcript database</a>.</td>
</tr></table>
</div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">PASA2 was officially released on June 5th, 2013. PASA2 includes many enhancements from the original PASA including extensive use of multi-threading for increased runtime performance, a modified MySQL database structure enabling the storage of multiple high-quality transcript alignments to be used in transcript assembly, and improved integration of Trinity to support RNA-Seq based genome annotation and generation of a comprehensive transcriptome database.</td>
</tr></table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_table_of_contents">Table of Contents</h2>
<div class="sectionbody">
<div class="ulist"><ul>
<li>
<p>
<a href="#A_intro">Introduction</a>
</p>
</li>
<li>
<p>
<a href="#A_annotPipe">PASA in the Context of a Complete Eukaryotic Annotation Pipeline</a>
</p>
</li>
<li>
<p>
<a href="#A_sys_overview">System Overview</a>
</p>
</li>
<li>
<p>
<a href="#A_obt_pasa">Obtaining PASA</a>
</p>
</li>
<li>
<p>
<a href="#A_sii">Software Installation Instructions</a>
</p>
</li>
<li>
<p>
<a href="#A_rcdaap">Running the Alignment Assembly Pipeline</a>
</p>
</li>
<li>
<p>
<a href="#A_RNASeq">Leveraging RNA-Seq by the PASA Pipeline</a>
</p>
</li>
<li>
<p>
<a href="#A_ComprehensiveTranscriptome">Build a comprehensive transcriptome database using genome-guided and de novo RNA-Seq assembly</a>
</p>
</li>
<li>
<p>
<a href="#A_acau">Annotation Comparisons and Annotation Updates</a>
</p>
</li>
<li>
<p>
<a href="#A_tourWebPortal">Tour of the PASA web portal</a>
</p>
</li>
<li>
<p>
<a href="#A_polya">Polyadenylation Sites Mapped to the Genome</a>
</p>
</li>
<li>
<p>
<a href="#A_alt_splice">Identification and Classification of All Alternative Splicing Variations</a>
</p>
</li>
<li>
<p>
Other useful applications:
</p>
<div class="ulist"><ul>
<li>
<p>
<a href="#A_train">Extraction of ORFs from PASA assemblies (transcriptome-based auto-annotation and/or reference ORFs for training gene predictors)</a>
</p>
</li>
<li>
<p>
<a href="#A_oiaa">Alignment assembly using simple text files as input</a>
</p>
</li>
</ul></div>
</li>
<li>
<p>
<a href="#A_reference">References</a>
</p>
</li>
<li>
<p>
<a href="#A_MailingLists">Mailing Lists</a>
</p>
</li>
</ul></div>
</div>
</div>
<div class="sect1">
<h2 id="A_intro">Introduction</h2>
<div class="sectionbody">
<div class="paragraph"><p>PASA was originally developed at <a href="http://www.tigr.org">The Institute for Genomic Research</a> in 2002 as an effort to automatically improve gene structures in Arabidopsis thaliana. Since then, it has been applied to numerous Eukaryotic genome annotation projects including Rice, Aspergillus species, Plasmodium falciparum, Schistosoma mansoni, Aedes aegypti, mouse, human, among others.</p></div>
<div class="paragraph"><p>Functions of PASA include:</p></div>
<div class="ulist"><ul>
<li>
<p>
model complete and partial gene structures based on assembled spliced alignments.
</p>
</li>
<li>
<p>
automatically incorporate gene structures based on transcript alignments into existing gene structure annotations, thereby maintaining annotations consistent with experimental evidence.  Annotation updates include
</p>
<div class="ulist"><ul>
<li>
<p>
annotating untranslated regions (UTRs)
</p>
</li>
<li>
<p>
exon additions, deletions, boundary adjustments
</p>
</li>
<li>
<p>
addition of models for alternative splicing variants
</p>
</li>
<li>
<p>
merging genes
</p>
</li>
<li>
<p>
splitting genes
</p>
</li>
<li>
<p>
modeling novel genes
</p>
</li>
</ul></div>
</li>
<li>
<p>
map polyadenylation sites to the genome
</p>
</li>
<li>
<p>
identify antisense transcripts
</p>
</li>
<li>
<p>
identify and classify all found splicing variations
</p>
</li>
<li>
<p>
report a likely set of partial and/or full-length protein-coding genes based on transcript alignments for training ab initio gene prediction tools.
</p>
</li>
</ul></div>
<div class="paragraph"><p>PASA is composed of a pipeline of utilities that perform the following ordered set of tasks:</p></div>
<div class="ulist"><ul>
<li>
<p>
cleaning the transcripts
</p>
<div class="ulist"><ul>
<li>
<p>
The seqclean utility, developed by the TIGR Gene Index group, is used to identify evidence of polyadenylation and strip the poly-A, trim vector, and discard low quality sequences.
</p>
</li>
</ul></div>
</li>
<li>
<p>
mapping and aligning transcripts to the genome
</p>
<div class="ulist"><ul>
<li>
<p>
GMAP and/or BLAT is used to map and align the transcripts to the genome.
</p>
</li>
</ul></div>
</li>
<li>
<p>
Validate nearly perfect alignments
</p>
<div class="ulist"><ul>
<li>
<p>
PASA utilizes only near perfect alignments.  These alignments are required to align with a specified percent identity (typically 95%) along a specified percent of the transcript length (typically 90%).  Each alignment is required to have consensus splice sites at all inferred intron boundaries, including (GT/GC donor with an AG acceptor, or the AT-AC U12-type dinucleotide pairs).
</p>
</li>
</ul></div>
</li>
<li>
<p>
Maximal assembly of spliced alignments
</p>
<div class="ulist"><ul>
<li>
<p>
The valid transcript alignments are clustered based on genome mapping location and assembled into gene structures that include the maximal number of compatible transcript alignments.  Compatible alignments are those that have identical gene structures in their region of overlap.  The products are termed PASA maximal alignment assemblies.  Those assemblies that contain at least one full-length cDNA are termed FL-assemblies; the rest are non-FL-assembles.
</p>
</li>
</ul></div>
</li>
<li>
<p>
Grouping alternatively spliced isoforms
</p>
<div class="ulist"><ul>
<li>
<p>
Alignment assemblies that map to the same genomic locus, significantly overlap, and are transcribed on the same strand, are grouped into clusters of assemblies.
</p>
</li>
</ul></div>
</li>
<li>
<p>
Automatic Genome Annotation
</p>
<div class="ulist"><ul>
<li>
<p>
Given a set of existing gene structure annotations, which may include the latest annotation for a given genome or the results of a single ab-initio gene finder, a comparison to the PASA alignment assemblies is performed.  Each alignment assembly is assigned a status identifier based on the results of the annotation comparison. The status identifier indicates whether or not the update is sanctioned as likely to improve the annotation, and the type of update that the assembly provides.  There are over 40 different status identifiers (actually, about 20 since half correspond to FL-assemblies and the other half to non-FL-assemblies).
</p>
</li>
<li>
<p>
In the absence of any preexisting gene annotations, novel genes and alternative splicing isoforms of novel genes can be modeled.
</p>
</li>
<li>
<p>
At any time, regardless of any existing annotations, users can obtain candidate gene structures based on the longest open reading frame (ORF) found within each PASA alignment assembly.  The output includes a fasta file for the proteins and a GFF3 file describing the gene structures.  This is useful when applied to a previously uncharacterized genome sequence, allowing one to rapidly obtaining a set of candidate gene structures for training various ab-intio gene prediction programs.  In the case of RNA-Seq, PASA can generate a full transcriptome-based genome annotation, identifying likely coding and non-coding transcripts.
</p>
</li>
</ul></div>
</li>
</ul></div>
<div class="sect2">
<h3 id="A_annotPipe">PASA in the Context of a Complete Eukaryotic Annotation Pipeline</h3>
<div class="paragraph"><p>PASA is only one component of a larger eukayotic annotation pipeline.  Comprehensive genome annotation relies on more than transcript sequence evidence.  Not all genes are expressed under assessed conditions, and some genes are expressed at low levels, which complicates their discovery and proper annotation.  Other forms of evidence are required for comprehensive genome annotation, including ab initio gene predictors and homology to proteins previously discovered in other sequenced genomes.  A complete annotation pipeline, as implemented at the <a href="http://broadinstitute.org">Broad Institute</a>, involves the following steps:</p></div>
<div class="ulist"><ul>
<li>
<p>
(A) ab initio gene finding using a selection of the following software tools: <a href="http://exon.biology.gatech.edu/">GeneMarkHMM</a>, <a href="http://linux1.softberry.com/berry.phtml?topic=index&amp;group=programs&amp;subgroup=gfind">FGENESH</a>, <a href="http://augustus.gobics.de/">Augustus</a>, and <a href="http://homepage.mac.com/iankorf/">SNAP</a>, <a href="http://www.cbcb.umd.edu/software/GlimmerHMM/">GlimmerHMM</a>.
</p>
</li>
<li>
<p>
(B) protein homology detection and intron resolution using the <a href="http://www.ebi.ac.uk/Tools/Wise2/index.html">GeneWise</a> software and the <a href="http://www.ebi.ac.uk/uniref/">uniref90</a> non-redundant protein database.
</p>
</li>
<li>
<p>
( C) alignment of known ESTs, full-length cDNAs, and most recently, <a href="http://trinityrnaseq.sf.net">Trinity</a> RNA-Seq assemblies to the genome.
</p>
</li>
<li>
<p>
(D) PASA alignment assemblies based on overlapping transcript alignments from step ( C)
</p>
</li>
<li>
<p>
(E) use of <a href="http://evidencemodeler.sf.net">EVidenceModeler (EVM)</a> to compute weighted consensus gene structure annotations based on the above (A, B, C, D)
</p>
</li>
<li>
<p>
(F) use of PASA to update the EVM consensus predictions, adding UTR annotations and models for alternatively spliced isoforms (leveraging D and E).
</p>
</li>
<li>
<p>
(G) limited manual refinement of genome annotations (F) using <a href="http://www.broadinstitute.org/annotation/argo/">Argo</a> or <a href="http://apollo.berkeleybop.org/current/index.html">Apollo</a>
</p>
</li>
</ul></div>
<div class="paragraph"><p>The following review of eukaryotic genome annotation methods describes in detail the use of PASA in the context of a more complete eukaryotic genome annotation system - see <a href="http://130.88.242.202/medicine/Aspergillus/articlesoverflow/22059117.pdf">Haas et al., Mycology. 2011 Oct 3;2(3):118-141</a>.</p></div>
<div class="paragraph"><p>The use of PASA in both applications: first assembling transcript alignments into PASA alignment assemblies, and then later using those PASA assemblies to update EVM consensus (or other) annotations, are described below.</p></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="A_sys_overview">System Overview</h2>
<div class="sectionbody">
<div class="paragraph"><p>PASA runs on a UNIX/LINUX-based architecture (including mac-osx).  PASA involves components written in Perl and C++.  Utilities used by PASA, including GMAP, are wrapped by Perl code.  Results are provided in summary text files including use of standard formats such as gtf, gff3, bed, fasta, and others.  Results are further available for analysis using the companion suite of Web-based tools and command-line utilities.  Running PASA to generate alignment assemblies requires only two inputs:  the targeted genome in FASTA format and the inputted transcripts (ESTs, de novo RNA-Seq assemblies, etc.) in FASTA format.</p></div>
<div class="paragraph"><p>In order to compare the assemblies to existing gene structure annotations, and optionally enhance known structures by adding UTRs, alt-splice variants, and exon adjustments, preexisting gene structure annotations can be provided in GFF3 format, or imported by a user-customized data adapter (described below).</p></div>
<div class="paragraph"><p>Sample data and a preconfigured complete PASA pipeline are available for demonstration purposes, all included in the software distribution.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="A_obt_pasa">Obtaining PASA</h2>
<div class="sectionbody">
<div class="paragraph"><p><a href="http://sourceforge.net/projects/pasa">Download</a> the latest version of the PASA software straight from Sourceforge</p></div>
</div>
</div>
<div class="sect1">
<h2 id="A_sii">Software Installation Instructions</h2>
<div class="sectionbody">
<div class="sect2">
<h3 id="A_psc">Prerequisite Software Components</h3>
<div class="paragraph"><p>In addition to the PASA software obtained here, you will need the following:</p></div>
<div class="ulist"><ul>
<li>
<p>
Relational Database
</p>
<div class="ulist"><ul>
<li>
<p>
MySQL (<a href="http://www.mysql.com">www.mysql.com</a>)
</p>
<div class="olist arabic"><ol class="arabic">
<li>
<p>
create a user/password with read-only access
</p>
</li>
<li>
<p>
create a user/password with all privileges
</p>
</li>
</ol></div>
</li>
</ul></div>
</li>
<li>
<p>
Perl Modules from CPAN (<a href="http://www.cpan.org">www.cpan.org</a>):
</p>
<div class="ulist"><ul>
<li>
<p>
DBD::mysql
</p>
</li>
</ul></div>
</li>
<li>
<p>
Bioinformatics Tools:
</p>
<div class="ulist"><ul>
<li>
<p>
Tom Wu&#8217;s <a href="http://research-pub.gene.com/gmap/">GMAP</a> cdna alignment utility.
</p>
</li>
<li>
<p>
Jim Kent&#8217;s <a href="http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc35.zip">BLAT</a> aligner
</p>
</li>
<li>
<p>
Bill Pearson&#8217;s <a href="http://faculty.virginia.edu/wrpearson/fasta/fasta3/CURRENT.tar.gz">FASTA</a> general sequence alignment utility.   Note that the fasta utility is bundled with other utilites as part of the Fasta3 suite.  The fasta utility (ie. named fasta35) should be renamed (or symlinked to) <em>fasta</em>.  This utility is required for annotation comparisons, but not needed for alignment assembly or alt-splicing analysis.
</p>
</li>
</ul></div>
</li>
</ul></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">The utilities provided by each software package above should be available via your PATH setting.</td>
</tr></table>
</div>
</div>
<div class="sect2">
<h3 id="A_upd">Unravelling the PASA distribution</h3>
<div class="paragraph"><p>Move the PASA distribution to a location on your filesystem that we can call PASAHOME, such as /usr/local/bin/PASA.  From henceforth, we&#8217;ll refer to this location as $PASAHOME.</p></div>
<div class="paragraph"><p>Build the components of PASA that require compilation by running:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>make</code></pre>
</div></div>
<div class="paragraph"><p>in the $PASAHOME directory.  This will build the utilities: pasa, slclust, cdbyank, and cdbfasta, and place them in the $PASAHOME/bin directory.</p></div>
<div class="ulist"><ul>
<li>
<p>
Optional: <strong>seqclean</strong>  $PASAHOME/seqclean provides the seqclean sofware developed by the TIGR Gene Index Group, and distributed along with PASA by permisson of John Quackenbush.  This is needed for cleaning EST sequences and identifying candidate polyadenylation sites.  Install the software by following the instructions provided.
</p>
</li>
</ul></div>
</div>
<div class="sect2">
<h3 id="A_ccdpp">Configuring the PASA Pipeline</h3>
<div class="paragraph"><p>After installing each of the software tools above, all that is needed before running PASA is to configure it.  The PASA configuration relies on the file:
      $PASAHOME/pasa_conf/conf.txt</p></div>
<div class="literalblock">
<div class="content">
<pre><code>A template configuration file is provided at
$PASAHOME/pasa_conf/pasa.CONFIG.template</code></pre>
</div></div>
<div class="literalblock">
<div class="content">
<pre><code>Simply copy pasa.CONFIG.template to conf.txt and set the values for your MySQL database settings.  You only need concern yourself with the following values:
PASA_ADMIN_EMAIL=(your email address)</code></pre>
</div></div>
<div class="literalblock">
<div class="content">
<pre><code>MYSQLSERVER=(your mysql server name)
MYSQL_RO_USER=(mysql read-only username)
MYSQL_RO_PASSWORD=(mysql read-only password)
MYSQL_RW_USER=(mysql all privileges username)
MYSQL_RW_PASSWORD=(mysql all privileges password)</code></pre>
</div></div>
</div>
<div class="sect2">
<h3 id="A_pwp">Setting Up the PASA Web Portal (optional, but highly recommended)</h3>
<div class="paragraph"><p>The PASA web portal provides a number of useful reports, search capabilities, and visualizations that can help with exploring the PASA assemblies and proposed annotation updates.  Visit the <a href="#A_tourWebPortal">Tour of the PASA web portal</a> for examples.</p></div>
<div class="paragraph"><p>The PASA web portal requires a webserver such as Apache (<a href="http://www.apache.org">www.apache.org</a>), and the <a href="http://search.cpan.org/dist/GD/">GD</a> PERL module to be installed.</p></div>
<div class="paragraph"><p>To install the web portal code, recursively copy (cp -r) the $PASAHOME area to the cgi-bin directory of your webserver.  Change permissions on everything so that it is world executable (ie.  % chmod -R 755 ./PASA )
Now, visit the URL for the status report page for the pasa database you created during the pasa run above.</p></div>
<div class="paragraph"><p><a href="http://yourServerName/cgi-bin/PASA/cgi-bin/status_report.cgi?db=$mysqldb">http://yourServerName/cgi-bin/PASA/cgi-bin/status_report.cgi?db=$mysqldb</a></p></div>
<div class="paragraph"><p>This will provide some summary statistics and links to additional web-based utilities for navigating the results from your pasa run.</p></div>
<div class="paragraph"><p>Now that you have a URL for your base PASA url, update your original configuration file at:
$PASAHOME/pasa_conf/conf.txt
to set the value of
BASE_PASA_URL=http://yourServerName/cgi-bin/PASA/cgi-bin/</p></div>
<div class="paragraph"><p>For more info, visit the <a href="#A_tourWebPortal">Tour of the PASA web portal</a></p></div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="A_rcdaap">Running the Alignment Assembly Pipeline</h2>
<div class="sectionbody">
<div class="ulist"><ul>
<li>
<p>
As input to the command-line driven PASA pipeline, we need only two (potentially three) input files.
</p>
<div class="olist arabic"><ol class="arabic">
<li>
<p>
The genome sequence in a multiFasta file (ie.  genome.fasta)
</p>
</li>
<li>
<p>
The transcript sequences in a multiFasta file (ie. transcripts.fasta)
</p>
</li>
<li>
<p>
Optional: a file containing the list of accessions corresponding to full-length cDNAs (ie. FL_accs.txt)
</p>
</li>
</ol></div>
</li>
</ul></div>
<div class="sect2">
<h3 id="A_sa">Step A: cleaning the transcript sequences [Optional, requires seqclean to be installed]</h3>
<div class="paragraph"><p>Have each of these files in the same <em>working</em> directory.  Then, run the seqclean utility on you transcripts like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% seqclean  transcripts.fasta</code></pre>
</div></div>
<div class="paragraph"><p>If you have a database of vector sequences (ie. <a href="http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html">UniVec</a>), you can screen for vector as part of the cleaning process by running the following instead:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% seqclean  transcripts.fasta -v /path/to/your/vectors.fasta</code></pre>
</div></div>
<div class="paragraph"><p>This will generate several output files including transcripts.fasta.cln and transcripts.fasta.clean
Both of these can be used as inputs to PASA.</p></div>
</div>
<div class="sect2">
<h3 id="A_wtce">Step B: Walking Thru A Complete Example Using the Provided Sample Data</h3>
<div class="paragraph"><p>Sample inputs are provided in the $PASAHOME/sample_data directory.  We&#8217;ll use these inputs to demonstrate the breadth of the software application, including using sample DATA ADAPTERs to import existing gene annotations into the database, and tentative structural updates out.</p></div>
<div class="paragraph"><p>The PASA pipeline requires separate configuration files for the alignment assembly and later annotation comparison steps, and these are configured separately for each run of the PASA pipeline, setting parameters to be used by the various tools and processes executed within the PASA pipeline.  Configuration file templates are provided as <em>$PASAHOME/pasa_conf/pasa.alignAssembly.Template.txt</em> and  <em>$PASAHOME/pasa_conf/pasa.annotationCompare.Template.txt</em>, and these will be further described when used below.</p></div>
<div class="paragraph"><p>The next steps explain the current contents of the sample_data directory. You do NOT need to redo these operations:</p></div>
<div class="ulist"><ul>
<li>
<p>
I&#8217;ve copied the ../pasa_conf/pasa.alignAssembly.Template.txt to alignAssembly.config and edited the pasa database name to <em>sample_mydb_pasa</em>.
</p>
</li>
<li>
<p>
My required input files exist as: genome_sample.fasta, all_transcripts.fasta, and since I have some full-length cDNAs, I&#8217;m including <em>FL_accs.txt</em> to identify these as such.
</p>
</li>
<li>
<p>
I already ran seqclean to generate files: all_transcripts.fasta.clean and all_transcripts.fasta.cln
</p>
</li>
</ul></div>
<div class="paragraph"><p>The following steps, you must execute in order to demonstrate the software. (The impatient can execute the entire pipeline below by running <em>./run_sample_pipeline.pl</em>.  If this is your first time through, it helps to walk through the steps below instead.)</p></div>
<div class="sect3">
<h4 id="A_tafaa">Transcript alignments followed by alignment assembly</h4>
<div class="ulist"><ul>
<li>
<p>
Run the PASA alignment assembly pipeline like so:
</p>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta \
 -t all_transcripts.fasta.clean -T -u all_transcripts.fasta -f FL_accs.txt --ALIGNERS blat,gmap --CPU 2</code></pre>
</div></div>
</li>
</ul></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">The <em>--ALIGNERS</em> can take values <em>gmap</em>, <em>blat</em>, or <em>gmap,blat</em>, in which case both aligners will be executed in parallel.  The CPU setting determines the number of threads to be used for each process. This is passed on to GMAP to indicate the thread count. In the case of BLAT, the transcript database is split into CPU number of partitions and each partition is searched separately and in parallel using BLAT.  Also, note that if <em>gmap,blat</em> is specified, then you may have up to 2*CPU number of processes running simultaneously.</td>
</tr></table>
</div>
<div class="paragraph"><p>This executes the following operations, generating the corresponding output files:
- aligns the all_transcripts.fasta file to genome_sample.fasta using the specified alignment tools.  Files generated include:
        * <em>sample_mydb_pasa.validated_transcripts.gff3,.gtf,.bed</em>  :the valid alignments
        * <em>sample_mydb_pasa.failed_gmap_alignments.gff3,.gtf,.bed</em>  :the alignments that fail validation test
    * <em>alignment.validations.output</em> :tab-delimited format describing the alignment validation results</p></div>
<div class="ulist"><ul>
<li>
<p>
the valid alignments are clustered into piles based on genome alignment position and piles are assembled using the PASA alignment assembler.  Files generated include:
</p>
<div class="ulist"><ul>
<li>
<p>
<em>sample_mydb_pasa.assemblies.fasta</em> :the PASA assemblies in FASTA format.
</p>
</li>
<li>
<p>
<em>sample_mydb_pasa.pasa_assemblies.gff3,.gtf,.bed</em> :the PASA assembly structures.
</p>
</li>
<li>
<p>
<em>sample_mydb_pasa.pasa_alignment_assembly_building.ascii_illustrations.out</em> :descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.
</p>
</li>
<li>
<p>
<em>sample_mydb_pasa.pasa_assemblies_described.txt</em> :tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.
</p>
</li>
</ul></div>
</li>
</ul></div>
</div>
<div class="sect3">
<h4 id="A_acau">Annotation Comparisons and Annotation Updates</h4>
<div class="paragraph"><p><strong>Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models</strong></p></div>
<div class="paragraph"><p>The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above.</p></div>
<div class="sect4">
<h5 id="A_gapmd">Loading your preexisting protein-coding gene annotations</h5>
<div class="paragraph"><p>Comparing to and updating existing gene structure annotations requires that we import these annotations into the PASA database, and are able to extract the suggested updates.  PASA utlizes annotation data adapters to achieve this.  GFF3 data adapters are included in the PASA distribution, but you can write your own, and directly tie the PASA pipeline to your own informatics infrastructure (ie. other relational database).   If you&#8217;d prefer to not use GFF3 and to write your own data adapters, visit the <a href="PASA_data_adapters.html">PASA data adapter cookbook</a>.</p></div>
<div class="paragraph"><p>A sample gff3-formatted annotation file is provided in our sample_data directory as <strong>orig_annotations_sample.gff3</strong> and can be loaded like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g genome_sample.fasta -P orig_annotations_sample.gff3</code></pre>
</div></div>
<div class="paragraph"><p>Before loading your own GFF3-formatted annotation files, be sure to check them for PASA compatibility like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../misc_utilities/pasa_gff3_validator.pl orig_annotations_sample.gff3</code></pre>
</div></div>
<div class="paragraph"><p>The above gff3-validator will report any entries in your gff3 file that it does not recognize, understand, or otherwise parse properly.  It&#8217;s not a general purpose gff3-validator since it cares only about your protein-coding genes.  (note that you should only feed protein-coding genes to PASA using the loader above).</p></div>
</div>
<div class="sect4">
<h5 id="_performing_an_annotation_comparison_and_generating_an_updated_gene_set">Performing an annotation comparison and generating an updated gene set</h5>
<div class="paragraph"><p>Now that the original annotations are loaded, we can perform a comparison of the PASA alignment assemblies to these preexisting gene annotations, to identify cases where updates can be automatically performed to gene structures in order to incorporate the transcript alignments.</p></div>
<div class="paragraph"><p>I&#8217;ve copied the ../pasa_conf/pasa.annotationCompare.Template.txt file to our working directory as <em>annotCompare.config</em>. Then, I replaced the MYSQLDB=&lt;<em>MYSQLDB</em>&gt; line with MYSQLDB=<em>sample_mydb_pasa</em> as before with the alignAssembly.config file.  Notice this config file contains numerous parameters that can be modified to tune the process to any genome of interest.  We&#8217;ll leave these values untouched for now, relying on the defaults used by PASA, and we&#8217;ll revisit parameterization later.  For most purposes, the defaults are well suited.  Run the annotation comparison like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Launch_PASA_pipeline.pl -c annotCompare.config -A -g genome_sample.fasta -t all_transcripts.fasta.clean</code></pre>
</div></div>
<div class="paragraph"><p>Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched.  This file will be named <em>${mysql_db}.gene_structures_post_PASA_updates.$pid.gff3</em>, where $pid is the process ID for this annotation comparison computation.</p></div>
<div class="paragraph"><p>You should revisit the status_report.cgi web page as described above under Setting Up the PASA Web Portal.  There, you will be able to navigate the results of the comparison and examine the classifications for annotation updates assigned to each pasa alignment assembly.</p></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">It usually requires at least two cycles of annotation loading, annotation comparison, and annotation updates in order to maximize the incorporation of transcript alignments into gene structures.  Updates made to gene structures in the first round often lead to the capacity to incorporate additional transcript alignments that did not fit well in the context of the earlier gene structures.   You can use the PASA-updated annotations in the GFF3 file created at the end of the annotation comparison step as input for a subsequent annotation comparison round. All of the results from the separate annotation comparison rounds remain accessible via the PASA web portal (see below).  The sample pipeline execution provided as <em>run_sample_pipeline.pl</em> runs the annotation comparison step twice, leveraging the output from the previous round in the subsequent round.</td>
</tr></table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="A_RNASeq">Leveraging RNA-Seq by the PASA Pipeline</h2>
<div class="sectionbody">
<div class="paragraph"><p>Illumina RNA-Seq is quickly revolutionizing gene discovery and gene structure annotation in eukaryotes.  Recent enhancements to the PASA pipeline including advancements in RNA-Seq de novo assembly now enable it to make use of these data for gene structure annotation.  It is now relatively straightforward to generate strand-specific RNA-Seq data via Illumina.  Given the great utility of strand-specific data in differentiating between sense and antisense transcription, plus given the great depth of transcriptome sequencing coverage and the great prevalence of antisense transcription,  strand-specific RNA-Seq data is highly preferred by the PASA pipeline.  PASA can still be used quite effectively in the case of non-strand-specific RNA-Seq, but the execution is quite different (see below).  The dUTP strand-specific RNA-Seq method by <a href="http://www.ncbi.nlm.nih.gov/pubmed/19620212">Parkhomchuk et al., NAR, 2009</a> is recommended.  For a comparison of strand-specific methods, see <a href="http://www.ncbi.nlm.nih.gov/pubmed/20711195">Comprehensive comparative analysis of strand-specific RNA sequencing methods. by Levin et al, Nat Methods, 2010</a>.</p></div>
<div class="paragraph"><p>The procedure for leveraging RNA-Seq in the PASA pipeline is very straightforward.  First, assemble the RNA-Seq data using our new <a href="http://trinityrnaseq.sf.net">Trinity de novo RNA-Seq assembly software</a>.  The RNA-Seq assembly process can be performed in either a genome-guided (recommended) or genome-free way.  Documentation for Trinity RNA-Seq  assembly (genome-guided or genome-free) is provided at <a href="http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html">http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html</a>.  Instructions for assembly of strand-specific and non-strand-specific RNA-Seq are provided.</p></div>
<div class="sect2">
<h3 id="_strand_specific_rna_seq">Strand-specific RNA-Seq</h3>
<div class="paragraph"><p>In the case of <strong>strand-specific RNA-Seq</strong>, run PASA with the Trinity transcript assemblies as input, including the <em>--transcribed_is_aligned_orient</em> parameter, to indicate that the Trinity transcripts were directionally assembled:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta --ALIGNERS blat,gmap\
   -t Trinity.fasta --transcribed_is_aligned_orient</code></pre>
</div></div>
<div class="paragraph"><p>The above will cluster and assemble alignments with minimal overlap.  If  your gene density is high and you expect transcripts from neighboring genes to often overlap in their UTR regions,  you can perform more stringent clustering of alignments like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code> % ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta --ALIGNERS blat,gmap\
-t Trinity.fasta --transcribed_is_aligned_orient \
--stringent_alignment_overlap 30.0</code></pre>
</div></div>
<div class="paragraph"><p>Also, as an alternative, If you have existing gene structure annotations that are reasonably accurate, you can cluster Trinity assemblies by locus (annotation-informed clustering) and further augment full-length transcript reconstruction from overlapping inchworm assemblies like so, with the alternative run command:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta --ALIGNERS blat,gmap\
  -t Trinity.fasta --transcribed_is_aligned_orient \
  -L --annots_gff3 coding_gene_annotations.gff3 \
  --gene_overlap 50.0</code></pre>
</div></div>
<div class="sect3">
<h4 id="_non_strand_specific_rna_seq">Non-Strand-specific RNA-Seq</h4>
<div class="paragraph"><p>In the case of non-strand-specific RNA-Seq, simply exclude the <em>--transcribed_is_aligned_orient</em> parameter and run like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g genome_sample.fasta -t Trinity.fasta --ALIGNERS blat,gmap</code></pre>
</div></div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="A_ComprehensiveTranscriptome">Build a Comprehensive Transcriptome Database Using Genome-guided and De novo RNA-Seq Assembly</h2>
<div class="sectionbody">
<div class="paragraph"><p>Depending on the genome and transcriptome samples under study, the genome may provide a limited view into the transcriptome. Our comprehensive transcriptome database-generating pipeline aims to:</p></div>
<div class="ulist"><ul>
<li>
<p>
Capture transcripts for genes missing from the genome (difficult to sequence regions, novel transcripts existing in the sample, etc).
</p>
</li>
<li>
<p>
Capture transcripts that align partially to the genome with exons falling into sequencing gaps.
</p>
</li>
<li>
<p>
Capture transcripts that cannot otherwise be represented properly according to the reference genome due to karyotype differences (ex. genome translocations).
</p>
</li>
</ul></div>
<div class="paragraph"><p>The transcripts are identified and included along with the PASA assemblies yielding a more comprehensive transcriptome database, to be used for downstream investigations into expressed gene content and differential expression analyses.</p></div>
<div class="paragraph"><p>Our system for building the comprehensive transcriptome database requires multiple sources of inputs: 1. <a href="http://trinityrnaseq.sf.net">Trinity de novo</a> RNA-Seq assemblies (ex. Trinity.fasta), 2. <a href="http://trinityrnaseq.sourceforge.net/genome_guided_trinity.html">Trinity genome-guided</a> RNA-Seq assemblies (ex. Trinity.GG.fasta), and (optionally) 3. <a href="http://cufflinks.cbcb.umd.edu/">Cufflinks</a> transcript structures (ex. cufflinks.gtf).</p></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">When applying Trinity to RNA-Seq samples derived from microbial eukaryotes, using either genome-free or genome-guided de novo assembly, be sure to use the <em>--jaccard_clip</em> parameter to reduce the occurrence of falsely-fused genome-neighboring transcripts.  Also, only include Cufflinks transcripts if applying the approach to expansive genomes of animals such as mouse or human, and exclude Cufflinks from application to compact microbial eukaryotic genomes.</td>
</tr></table>
</div>
<div class="paragraph"><p>After generating the inputs according to their separate procedures linked above, you can run PASA according to the following steps:</p></div>
<div class="olist arabic"><ol class="arabic">
<li>
<p>
Concatenate the Trinity.fasta and Trinity.GG.fasta files into a single <em>transcripts.fasta</em> file.
</p>
<div class="literalblock">
<div class="content">
<pre><code>cat Trinity.fasta Trinity.GG.fasta &gt; transcripts.fasta</code></pre>
</div></div>
</li>
<li>
<p>
Create a file containing the list of transcript accessions that correspond to the Trinity de novo assembly (full de novo, <strong>not</strong> genome-guided).
</p>
<div class="literalblock">
<div class="content">
<pre><code>$PASA_HOME/misc_utilities/accession_extractor.pl &lt; Trinity.fasta &gt; tdn.accs</code></pre>
</div></div>
</li>
<li>
<p>
Run PASA using RNA-Seq related options as described in the section above, but include the parameter setting <em>--TDN tdn.accs</em>.  To (optionally) include Cufflinks-generated transcript structures, further include the parameter setting <em>--cufflinks_gtf cufflinks.gtf</em>.  Note, Cufflinks may not be appropriate for gene-dense targets, such as in fungi; Cufflinks excels when applied to vertebrate genomes, so best to include when applying to mouse or human.
</p>
</li>
<li>
<p>
After completing the PASA alignment assembly, generate the comprehensive transcriptome database via:
</p>
<div class="literalblock">
<div class="content">
<pre><code>$PASA_HOME/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta --min_per_ID 95 --min_per_aligned 30</code></pre>
</div></div>
</li>
</ol></div>
<div class="paragraph"><p>This examines the Trinity de novo assemblies (specified by the --TDN parameter in the PASA run).  The following groupings are performed:</p></div>
<div class="olist loweralpha"><ol class="loweralpha">
<li>
<p>
Those TDN accessions mapping at above the <em>--min_per_ID</em> and <em>-min_per_aligned</em> values but otherwise failing the stringent alignment validation requirements (splice sites, contiguity, etc) are assigned to PASA assembly clusters (genes) based on exon overlap.  Those not mapping to PASA assemblies retain their gene identifier assigned as the Trinity component.  Likewise, those TDN entries that map poorly to the genome (below --min_per_id and min_per_aligned criteria) or do not map to the genome at all are assigned gene identifers based on the Trinity component identifier.  PASA assemblies and those TDN entries that were not included in PASA assemblies (not mapping or invalid alignments) are reported as a single data set.
</p>
</li>
</ol></div>
<div class="paragraph"><p>The resulting data files should include:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>compreh_init_build/compreh_init_build.fasta                :the transcript sequences
compreh_init_build/compreh_init_build.geneToTrans_mapping  :the gene/transcript mapping file (for use with RSEM, Trinotate, other tools)</code></pre>
</div></div>
<div class="literalblock">
<div class="content">
<pre><code>compreh_init_build/compreh_init_build.bed                  :transcript structures in bed format
compreh_init_build/compreh_init_build.gff3                 :transcript structures in gff3 format</code></pre>
</div></div>
<div class="literalblock">
<div class="content">
<pre><code>compreh_init_build/compreh_init_build.details              :classifications of transcripts according to genome mapping status.</code></pre>
</div></div>
<div class="paragraph"><p>The classifications include:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>pasa  : PASA alignment assembly
InvalidQualityAlignment_YES_PASAmap : invalid alignment that maps at percent identity and alignment length requirement, and overlaps a PASA exon
InvalidQualityAlignment_NO_PASAmap : same as above, but doesn't map to a PASA exon
PoorAlignment_TreatUnmapped : invalid alignment that does not meet percent identity and length requirements (potentially missing from genome)
TDN_noMap : no alignment to the genome reported (missing from the genome).</code></pre>
</div></div>
</div>
</div>
<div class="sect1">
<h2 id="A_tourWebPortal">Tour of the PASA web portal</h2>
<div class="sectionbody">
<div class="paragraph"><p>The results from running PASA on our sample data set can be examined via the PASA web portal.  For example purposes, I&#8217;ve saved a few of the reports generated by the PASA web displays (note pages are generated on-the-fly, however these are provided as static only for example purposes).</p></div>
<div class="ulist"><ul>
<li>
<p>
Summary report for alignment assembly and each annotation comparison: <a href="portalTour/status_report_cgi.html">status_report.html</a>
</p>
</li>
<li>
<p>
Description of an individual alignment assembly as compared to an existing annotation: <a href="portalTour/assemblyReport.html">assembly_report_example.html</a>
</p>
</li>
<li>
<p>
Classification of an alternatively spliced gene: <a href="portalTour/altSpliceReport.html">alt_splice_example.html</a>
</p>
</li>
</ul></div>
<div class="paragraph"><p>The above are just a few examples.  Install the PASA portal and navigate your PASA results.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="A_polya">Polyadenylation Sites Mapped to the Genome</h2>
<div class="sectionbody">
<div class="paragraph"><p>If <strong>seqclean</strong> was used to clean the transcript sequences, and both the cleaned and original transcript databases were provided in the alignment assembly run of the PASA pipeline as described, then the polyadenylation sites as evidenced in the original transcript sequences and identified as part of the seqclean process were mapped to the genome.  The termini of the polyadenylated transcripts are compared to the genome, and those transcripts that truly appear to be polyadenylated and not resulting from an artifact of internal priming to an A-rich region, are reported as candidate polyA sites.  The genome coordinate reported as the polyA site is the nucleotide to which polyA is added, so it corresponds to the last non-polyA nucleotide of the polyadenylated transcript.  An example of a candidate polyA site can be extracted from one of the log files (default <em>pasa_run.$pid.log/polyAsite_analysis.out</em>) like so:</p></div>
<div class="listingblock">
<div class="content">
<pre><code>// cdna:gi|51968615|dbj|AK175237.1|, annotdb_asmbl_id:68712, polyAcoord:50443, transcribedOrient:+, rend
CGCTTCTTATattacagggt
CGCTTCTTATAAAAAAAAAA       gi|51968615|dbj|AK175237.1|  TransOrient (+)
trimmedSeq:
          AAAAAAAAAA
OK polyA site candidate.</code></pre>
</div></div>
<div class="paragraph"><p>An additional fasta file (default <em>${mysql_db}.polyAsites.fasta</em>) summarizes all mapped polyA sites supported by the transcripts.  A 100 bp segment of the genome sequence is extracted and oriented, and the last nucleotide in uppercase corresponds to the residue to which polyA is added in the processed transcript.  The site corresponding to our example above is as follows:</p></div>
<div class="listingblock">
<div class="content">
<pre><code>&gt;68712-50443_+ 1 transcripts: gi|51968615|dbj|AK175237.1|
ATCGACCACCCTCTTTTTTATAAGTAACTTTTCAAGATAACGCTTCTTATattacagggtctacttccattacaaatgcaataggtttgatggttaataa</code></pre>
</div></div>
<div class="paragraph"><p>The accession is bundled like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>genome_accession - polyA_coordinate _ transcribed_orientation</code></pre>
</div></div>
<div class="paragraph"><p>The rest of the header indicates the number of transcripts supporting this polyA site followed by the list of those transcript accessions.  The examples above were extracted from our sample data set provided.  A more compelling example for Arabidopsis, using spliced transcripts only, is as follows:</p></div>
<div class="listingblock">
<div class="content">
<pre><code>&gt;chr5-506542_- 44 transcripts: gi|86086725|gb|DR382484.1|DR382484,gi|86082384|gb|DR378143.1|DR378143,gi|86082270|gb|DR378029.1|DR378029,gi|86082193|gb|DR377952.1|DR37795
2,gi|86082172|gb|DR377931.1|DR377931,gi|86082156|gb|DR377915.1|DR377915,gi|86082123|gb|DR377882.1|DR377882,gi|86082071|gb|DR377830.1|DR377830,gi|86081971|gb|DR377730.
1|DR377730,gi|86081887|gb|DR377646.1|DR377646,gi|86081885|gb|DR377644.1|DR377644,gi|86081868|gb|DR377627.1|DR377627,gi|86081709|gb|DR377466.1|DR377466,gi|86081657|gb|
DR377414.1|DR377414,gi|86081635|gb|DR377392.1|DR377392,gi|86081559|gb|DR377316.1|DR377316,gi|86081550|gb|DR377307.1|DR377307,gi|86081543|gb|DR377300.1|DR377300,gi|860
81529|gb|DR377286.1|DR377286,gi|86081252|gb|DR377009.1|DR377009,gi|86081247|gb|DR377004.1|DR377004,gi|86081239|gb|DR376996.1|DR376996,gi|86079014|gb|DR374771.1|DR3747
71,gi|86076986|gb|DR372743.1|DR372743,gi|85870703|gb|DR191655.1|DR191655,gi|85869935|gb|DR190887.1|DR190887,gi|85869920|gb|DR190872.1|DR190872,gi|85869608|gb|DR190560
.1|DR190560,gi|85869452|gb|DR190404.1|DR190404,gi|85869353|gb|DR190305.1|DR190305,gi|85869352|gb|DR190304.1|DR190304,gi|85869340|gb|DR190292.1|DR190292,gi|85869337|gb
|DR190289.1|DR190289,gi|85869336|gb|DR190288.1|DR190288,gi|85869335|gb|DR190287.1|DR190287,gi|85869329|gb|DR190281.1|DR190281,gi|85868471|gb|DR189423.1|DR189423,gi|85
867798|gb|DR188750.1|DR188750,gi|85867058|gb|DR188010.1|DR188010,gi|49285508|gb|BP634256.1|BP634256,gi|32888810|gb|CB264037.1|CB264037,gi|32888295|gb|CB263522.1|CB263
522,gi|32885705|gb|CB260932.1|CB260932,gi|32885650|gb|CB260877.1|CB260877
GTTTTATCTTTGTGACTTTATTAATCCTAAGACTATTATGGGTTTGTATTaaagtttgcttctttcttgctcactacacaattaagattcaagcccattg</code></pre>
</div></div>
<div class="admonitionblock">
<table><tr>
<td class="icon">
<div class="title">Note</div>
</td>
<td class="content">Polyadenylation sites identified here require that there is evidence of polyadenylation in the original transcript sequence.  Other systems examine clusters of transcript alignment termini within windows.  This is not done here <strong>yet</strong> as part of PASA.  Only those polyA sites supported by experimental evidence of polyadenylation are reported.  Also, the poly-A analysis modules were built based on EST sequencing and are not yet updated for use with next-gen RNA-Seq analysis.</td>
</tr></table>
</div>
</div>
</div>
<div class="sect1">
<h2 id="A_alt_splice">Identification and Classification of All Alternative Splicing Variations</h2>
<div class="sectionbody">
<div class="paragraph"><p>PASA is a tool well suited to the identification and classification of alternative splicing isoforms as evidenced by incompatible transcript alignments.  Overlapping alignments found incompatible in that they have some structural difference within their overlapping region, and due to their nature of incompatibility, they are relegated to different but overlapping alignment assemblies.  PASA performs and all-vs-all comparison among the clustered overlapping alignment assemblies to identify the following categories of splicing variations:</p></div>
<div class="ulist"><ul>
<li>
<p>
alternative donor or acceptor
</p>
</li>
<li>
<p>
retained or spliced intron
</p>
</li>
<li>
<p>
starts or ends in an intron
</p>
</li>
<li>
<p>
skipped or retained exons
</p>
</li>
<li>
<p>
alternate terminal exons
</p>
</li>
</ul></div>
<div class="paragraph"><p>The automated alternative splicing analysis is run as part of the alignment-assembly pipeline.</p></div>
<div class="paragraph"><p>The results are available in the default output files, with examples below shown from the sample_data/ pipeline run:</p></div>
<div class="paragraph"><p>File <em>${mysql_db}.alt_splice_label_combinations.dat</em> :a tab-delimited listing that contains all unique splicing labels for each pasa alignment assembly labeled with a variation.  For example:</p></div>
<div class="listingblock">
<div class="content">
<pre><code>genome  pasa_acc  assembly_cluster   combinations_of_labels
68711   asmbl_2 1       ends_in_intron
68711   asmbl_6 3       alt_donor
68711   asmbl_4 3       alt_donor
68711   asmbl_10        6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_11        6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_9 6       alt_acceptor, retained_exon, skipped_exon
68711   asmbl_24        14      spliced_intron, starts_in_intron
68711   asmbl_23        14      retained_intron
...</code></pre>
</div></div>
<div class="paragraph"><p>File <em>${mysql_db}.indiv_splice_labels_and_coords.dat</em> provides the genome coordinates for each alternative splicing label applied to each corresponding pasa alignment assembly.  For example:</p></div>
<div class="listingblock">
<div class="content">
<pre><code>genome_acc  pasa_acc  assembly_cluster altsplice_label  genome_lend genome_rend transcribed_orient list_of_cdnas_supporting_variation
68711   asmbl_10        6       alt_acceptor    35633   35634   -       gi|42468094|emb|BX819464.1|CNS0A8YA
68711   asmbl_11        6       alt_acceptor    35639   35640   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       retained_exon   35448   35498   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42528978|gb|BX835128.1|BX835128
68711   asmbl_11        6       skipped_exon    35448   35498   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       retained_exon   36174   36227   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_11        6       skipped_exon    36174   36227   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_11        6       retained_exon   36268   36309   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       skipped_exon    36268   36309   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_11        6       retained_exon   36879   37028   -       gi|6782248|emb|AJ271597.1|ATH271597
68711   asmbl_10        6       skipped_exon    36879   37028   -       gi|42468094|emb|BX819464.1|CNS0A8YA,gi|42532609|gb|BX838526.1|BX838526
68711   asmbl_10        6       alt_acceptor    35633   35634   -       gi|42468094|emb|BX819464.1|CNS0A8YA
68711   asmbl_9 6       alt_acceptor    35639   35640   -       gi|11125656|emb|AJ294534.1|ATH294534,gi|13398925|emb|AJ276619.1|ATH276619
...</code></pre>
</div></div>
<div class="paragraph"><p>The PASA web portal provides numerous reports, graphs, and illustrations to navigate the results of the automated alternative splicing analysis.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="A_oiaa">Only Interested in Alignment Assembly?</h2>
<div class="sectionbody">
<div class="paragraph"><p>In our current working directory, there&#8217;s a file <em>clusters_of_valid_alignments.txt</em> that contains all the clusters of valid alignments in a simple text format like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>// cluster: number
accession,transcribed_orientation,lend-rend,lend-rend,...
...</code></pre>
</div></div>
<div class="paragraph"><p>The transcribed orientation is +,-, or ?.  The ? orientation should be used only for single-exon transcript alignments for which the orientation of transcription is ambiguous.  By default, PASA assigns all single-exon transcripts that lack evidence of polyadenylation to the ambiguous transcribed orientation.  Given this input file, we can demonstrate the pasa alignment assembler like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% ../scripts/pasa_alignment_assembler_textprocessor.pl &lt; clusters_of_valid_alignments.txt</code></pre>
</div></div>
<div class="paragraph"><p>Each cluster of transcript alignments is assembled separately and the results are outputted to stdout with illustrations.</p></div>
<div class="paragraph"><p><strong>Example input</strong></p></div>
<div class="listingblock">
<div class="content">
<pre><code>// cluster: 52
gi|14532493|gb|AY039871.1|,-,38468-38715,38808-39953
gi|14532527|gb|AY039888.1|,-,38468-38715,38808-39953
gi|18655376|gb|AY077666.1|,-,38846-39847
gi|19801675|gb|AV782885.1|AV782885,-,38468-38715,38808-39255
gi|19839856|gb|AV805871.1|AV805871,-,38478-38715,38808-38972
gi|19861773|gb|AV819822.1|AV819822,-,38496-38715,38808-39021
gi|19864228|gb|AV822195.1|AV822195,?,39309-39953
gi|21403701|gb|AY084991.1|,-,38331-38715,38912-39950
gi|32362537|gb|CB074156.1|CB074156,?,38866-39212
gi|42467384|emb|BX819813.1|CNS0A8I9,-,38509-38715,38808-39898
gi|42467462|emb|BX820042.1|CNS0A8GI,-,38481-38715,38808-39873
gi|42467544|emb|BX820309.1|CNS0A8LV,-,38509-38715,38808-39907
gi|42467850|emb|BX818822.1|CNS0A905,-,38506-38715,38808-39907
gi|42468073|emb|BX819411.1|CNS0A8VM,-,38495-38715,38912-39907
gi|42468257|emb|BX820772.1|CNS0A8PI,-,38434-38715,38808-39907
gi|49289224|gb|BP637972.1|BP637972,-,38427-38715,38808-38892
gi|56086876|gb|BP562044.2|BP562044,?,39467-39919
gi|58799838|gb|BP779059.1|BP779059,-,38468-38715,38912-39063
gi|59847772|gb|BP811693.1|BP811693,?,39525-39918
gi|59898821|gb|BP837850.1|BP837850,?,39540-39918
gi|86056909|gb|DR352666.1|DR352666,?,39578-39950
gi|86056910|gb|DR352667.1|DR352667,?,39681-39894
gi|86056911|gb|DR352668.1|DR352668,?,39496-39950
gi|86056912|gb|DR352669.1|DR352669,?,39454-39907
gi|86056913|gb|DR352670.1|DR352670,?,39507-39950
gi|86056914|gb|DR352671.1|DR352671,?,39437-39919
gi|86084686|gb|DR380445.1|DR380445,-,38331-38715,38912-39127
gi|8678774|gb|AV519247.1|AV519247,-,38401-38715,38808-38918
gi|8682044|gb|AV522517.1|AV522517,-,38486-38715,38912-39124
gi|8700432|gb|AV538676.1|AV538676,-,38506-38715,38912-39282</code></pre>
</div></div>
<div class="paragraph"><p><strong>Corresponding Output</strong></p></div>
<div class="listingblock">
<div class="content">
<pre><code>Individual Alignments: (30)
  0 --------------&gt;      &lt;---------------------------------------       (a+/s-)gi|21403701|gb|AY084991.1|
  1 --------------&gt;      &lt;--------      (a+/s-)gi|86084686|gb|DR380445.1|DR380445
  2    -----------&gt;   &lt;----     (a+/s-)gi|8678774|gb|AV519247.1|AV519247
  3     ----------&gt;   &lt;---      (a+/s-)gi|49289224|gb|BP637972.1|BP637972
  4     ----------&gt;   &lt;---------------------------------------- (a+/s-)gi|42468257|emb|BX820772.1|CNS0A8PI
  5      ---------&gt;   &lt;------------------------------------------       (a+/s-)gi|14532493|gb|AY039871.1|
  6      ---------&gt;   &lt;------------------------------------------       (a+/s-)gi|14532527|gb|AY039888.1|
  7      ---------&gt;   &lt;---------------- (a+/s-)gi|19801675|gb|AV782885.1|AV782885
  8      ---------&gt;      &lt;------        (a+/s-)gi|58799838|gb|BP779059.1|BP779059
  9      ---------&gt;   &lt;------   (a+/s-)gi|19839856|gb|AV805871.1|AV805871
 10       --------&gt;   &lt;---------------------------------------  (a+/s-)gi|42467462|emb|BX820042.1|CNS0A8GI
 11       --------&gt;      &lt;--------      (a+/s-)gi|8682044|gb|AV522517.1|AV522517
 12       --------&gt;      &lt;------------------------------------- (a+/s-)gi|42468073|emb|BX819411.1|CNS0A8VM
 13       --------&gt;   &lt;-------- (a+/s-)gi|19861773|gb|AV819822.1|AV819822
 14       --------&gt;   &lt;---------------------------------------- (a+/s-)gi|42467850|emb|BX818822.1|CNS0A905
 15       --------&gt;      &lt;--------------        (a+/s-)gi|8700432|gb|AV538676.1|AV538676
 16        -------&gt;   &lt;---------------------------------------- (a+/s-)gi|42467384|emb|BX819813.1|CNS0A8I9
 17        -------&gt;   &lt;---------------------------------------- (a+/s-)gi|42467544|emb|BX820309.1|CNS0A8LV
 18                    --------------------------------------   (a+/s-)gi|18655376|gb|AY077666.1|
 19                     --------------  (a+/s?)gi|32362537|gb|CB074156.1|CB074156
 20                                     -------------------------       (a+/s?)gi|19864228|gb|AV822195.1|AV822195
 21                                          -------------------        (a+/s?)gi|86056914|gb|DR352671.1|DR352671
 22                                           ----------------- (a+/s?)gi|86056912|gb|DR352669.1|DR352669
 23                                           ------------------        (a+/s?)gi|56086876|gb|BP562044.2|BP562044
 24                                            ------------------       (a+/s?)gi|86056911|gb|DR352668.1|DR352668
 25                                             -----------------       (a+/s?)gi|86056913|gb|DR352670.1|DR352670
 26                                             ----------------        (a+/s?)gi|59847772|gb|BP811693.1|BP811693
 27                                              ---------------        (a+/s?)gi|59898821|gb|BP837850.1|BP837850
 28                                               ---------------       (a+/s?)gi|86056909|gb|DR352666.1|DR352666
 29                                                   --------- (a+/s?)gi|86056910|gb|DR352667.1|DR352667


ASSEMBLIES: (2)
       -----------&gt;   &lt;------------------------------------------       (a-/s-)gi|8678774|gb|AV519247.1|AV519247/gi|49289224|gb|BP637972.1|BP637972/gi|42468257|emb|BX820772.1|CNS0A8PI/gi|14532493|gb|AY039871.1|/gi|14532527|gb|AY039888.1|/gi|19801675|gb|AV782885.1|AV782885/gi|19839856|gb|AV805871.1|AV805871/gi|42467462|emb|BX820042.1|CNS0A8GI/gi|19861773|gb|AV819822.1|AV819822/gi|42467850|emb|BX818822.1|CNS0A905/gi|42467384|emb|BX819813.1|CNS0A8I9/gi|42467544|emb|BX820309.1|CNS0A8LV/gi|18655376|gb|AY077666.1|/gi|32362537|gb|CB074156.1|CB074156/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667
    --------------&gt;      &lt;---------------------------------------       (a-/s-)gi|21403701|gb|AY084991.1|/gi|86084686|gb|DR380445.1|DR380445/gi|58799838|gb|BP779059.1|BP779059/gi|8682044|gb|AV522517.1|AV522517/gi|42468073|emb|BX819411.1|CNS0A8VM/gi|8700432|gb|AV538676.1|AV538676/gi|19864228|gb|AV822195.1|AV822195/gi|86056914|gb|DR352671.1|DR352671/gi|86056912|gb|DR352669.1|DR352669/gi|56086876|gb|BP562044.2|BP562044/gi|86056911|gb|DR352668.1|DR352668/gi|86056913|gb|DR352670.1|DR352670/gi|59847772|gb|BP811693.1|BP811693/gi|59898821|gb|BP837850.1|BP837850/gi|86056909|gb|DR352666.1|DR352666/gi|86056910|gb|DR352667.1|DR352667



Assembly(1): orient(a-/s-) align: 38401(1461)-38715(1147)&gt;YY....XX&lt;38808(1146)-39953(1)
Assembly(2): orient(a-/s-) align: 38331(1427)-38715(1043)&gt;YY....XX&lt;38912(1042)-39953(1)</code></pre>
</div></div>
</div>
</div>
<div class="sect1">
<h2 id="A_train">Extraction of ORFs from PASA assemblies (auto-annotation and/or reference ORFs for training gene predictors)</h2>
<div class="sectionbody">
<div class="paragraph"><p>The PASA alignment assemblies can be used to automatically extract protein coding regions to be used for automated transcript-based genome annotation and/or for generating a high quality data set for training ab initio gene predictors (ex. Augustus, SNAP, genemarkHMM, glimmerHMM, etc.).  Our <a href="http://transdecoder.sf.net">TransDecoder</a> software, bundled with PASA, is used to identify likely coding regions.</p></div>
<div class="paragraph"><p>After running the PASA to assemble all transcript alignments as described <a href="#A_sc">above</a>, you can run the following.</p></div>
<div class="paragraph"><p>Run the following from your PASA working directory. The example below is what I would run in the sample_data/ directory:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>% $PASAHOME/scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta ${pasadb}.assemblies.fasta --pasa_transcripts_gff3 ${pasadb}.pasa_assemblies.gff3</code></pre>
</div></div>
<div class="paragraph"><p>This should generate a series of files, described below:</p></div>
<div class="ulist"><ul>
<li>
<p>
<em>fasta.transdecoder.cds,.pep,.gff3,.bed</em> :correspond to likely coding regions found in the PASA assemblies, coordinates based on the transcripts and not the genome.
</p>
</li>
<li>
<p>
<em>fasta.transdecoder.genome.bed,gff3</em> :coordinates of gene models based on the genome sequence.
</p>
</li>
</ul></div>
<div class="paragraph"><p>The <em>fasta.transdecoder.pep</em> file has headers like so:</p></div>
<div class="literalblock">
<div class="content">
<pre><code>&gt;asmbl_10|m.58 asmbl_10|g.58  ORF asmbl_10|g.58 asmbl_10|m.58 type:5prime_partial len:107 (+) asmbl_10:1-322(+)
&gt;asmbl_10|m.57 asmbl_10|g.57  ORF asmbl_10|g.57 asmbl_10|m.57 type:complete len:515 (+) asmbl_10:156-1700(+)
&gt;asmbl_100|m.113 asmbl_100|g.113  ORF asmbl_100|g.113 asmbl_100|m.113 type:complete len:208 (-) asmbl_100:205-828(-)
&gt;asmbl_103|m.138 asmbl_103|g.138  ORF asmbl_103|g.138 asmbl_103|m.138 type:5prime_partial len:365 (+) asmbl_103:1-1096(+)
&gt;asmbl_104|m.147 asmbl_104|g.147  ORF asmbl_104|g.147 asmbl_104|m.147 type:5prime_partial len:251 (+) asmbl_104:1-754(+)
&gt;asmbl_118|m.149 asmbl_118|g.149  ORF asmbl_118|g.149 asmbl_118|m.149 type:3prime_partial len:129 (+) asmbl_118:20-407(+)
&gt;asmbl_119|m.160 asmbl_119|g.160  ORF asmbl_119|g.160 asmbl_119|m.160 type:5prime_partial len:655 (+) asmbl_119:1-1965(+)
&gt;asmbl_12|m.24 asmbl_12|g.24  ORF asmbl_12|g.24 asmbl_12|m.24 type:complete len:334 (+) asmbl_12:107-1108(+)
&gt;asmbl_120|m.126 asmbl_120|g.126  ORF asmbl_120|g.126 asmbl_120|m.126 type:3prime_partial len:520 (+) asmbl_120:61-1620(+)
&gt;asmbl_121|m.136 asmbl_121|g.136  ORF asmbl_121|g.136 asmbl_121|m.136 type:3prime_partial len:189 (+) asmbl_121:803-1371(+)
&gt;asmbl_122|m.145 asmbl_122|g.145  ORF asmbl_122|g.145 asmbl_122|m.145 type:5prime_partial len:394 (+) asmbl_122:1-1184(+)</code></pre>
</div></div>
<div class="paragraph"><p>The accession corresponds to the PASA assembly. The type indicator can be any of the following: complete, 5prime_partial, 3prime_partial, or internal.  The 5prime_partial are missing a start codon and translate to the very 5' end, 3prime_partial are missing a stop codon and translate to their very 3' end, and internal translate from the first to the last basepair in the sequence, missing a start and a stop codon.  The <em>complete</em> category are of greatest interest for the prospects of ab initio genefinder training.  Typically, we would search these complete proteins against the non-redundant protein database at GenBank and identify those ORFs that have good database matches across most of their length.  Such entries can be confidently used for training, in addition to those particularly long ORFs that do not match known proteins and are sufficiently complex in sequence.</p></div>
</div>
</div>
<div class="sect1">
<h2 id="A_reference">References</h2>
<div class="sectionbody">
<div class="paragraph"><p>The PASA software and its original application are described in:</p></div>
<div class="ulist"><ul>
<li>
<p>
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. <a href="http://nar.oupjournals.org/cgi/content/full/31/19/5654">Nucleic Acids Res, 31, 5654-5666</a>.
</p>
</li>
</ul></div>
<div class="paragraph"><p>The use of PASA to analyze polyadenylation signals is described in:</p></div>
<div class="ulist"><ul>
<li>
<p>
Loke JC, Stahlberg EA, Strenski DG, Haas BJ, Wood PC, Li QQ.  (2005) Compilation of mRNA polyadenylation signals in Arabidopsis revealed a new signal element and potential secondary structures.   <a href="http://www.plantphysiol.org/cgi/content/full/138/3/1457">Plant Physiol. 2005 Jul;138(3):1457-68. Epub 2005 Jun 17</a>
</p>
</li>
<li>
<p>
Shen Y, Ji G, Haas BJ, Wu X, Zheng J, Reese GJ, Li QQ.  (2008) Genome level analysis of rice mRNA 3'-end processing signals and alternative polyadenylation. <a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&amp;pubmedid=18411206">Nucleic Acids Res. 2008 May; 36(9): 3150–3161.</a>
</p>
</li>
</ul></div>
<div class="paragraph"><p>Enhancements to PASA that automate the identification and classification of alternative splicing variations are described here:</p></div>
<div class="ulist"><ul>
<li>
<p>
Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR (2006) Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis.  <a href="http://www.biomedcentral.com/1471-2164/7/327">BMC Genomics 2006, 7:327</a>
</p>
</li>
<li>
<p>
Haas, BJ. (2008) Analysis of Alternative Splicing in Plants with Bioinformatics Tools (book chapter in:  <a href="http://www.springer.com/life+sci/plant+sciences/book/978-3-540-76775-6">Nuclear pre-mRNA Processing in Plants</a>)
</p>
</li>
</ul></div>
<div class="paragraph"><p>Using PASA along with EVidenceModeler in a complete eukaryotic genome annotation pipeline</p></div>
<div class="ulist"><ul>
<li>
<p>
Haas et al. (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. <a href="http://genomebiology.com/2008/9/1/R7">Genome Biology 2008, 9:R7doi:10.1186/gb-2008-9-1-r7.</a>
</p>
</li>
</ul></div>
<div class="paragraph"><p>Earlier work involving the incorporation of RNA-Seq data into gene structure annotation improvements using PASA and the Inchworm component of Trinity: (Note, the new PASA/Trinity process described above is considerably different in execution, but similar in principle. Manuscript in prep.)</p></div>
<div class="ulist"><ul>
<li>
<p>
Rhind, et al. (2011) Comparative Functional Genomics of the Fission Yeasts. <a href="http://www.ncbi.nlm.nih.gov/pubmed/21511999">Science. 2011 Apr 21</a>.
</p>
</li>
<li>
<p>
Haas, et al. (2011) Approaches to Fungal Genome Annotation <a href="http://www.ncbi.nlm.nih.gov/pubmed/22059117">Mycology. 2011 Oct 3;2(3):118-141.</a>
</p>
</li>
</ul></div>
</div>
</div>
<div class="sect1">
<h2 id="A_MailingLists">Mailing Lists</h2>
<div class="sectionbody">
<div class="ulist"><ul>
<li>
<p>
<a href="https://lists.sourceforge.net/lists/listinfo/pasa-announce">pasa-announce@lists.sourceforge.net</a> for announcements regarding new software releases and related notifications.
</p>
</li>
<li>
<p>
<a href="https://lists.sourceforge.net/lists/listinfo/pasa-help">pasa-help@lists.sourceforge.net</a> for questions and help from the PASA user community.
</p>
</li>
</ul></div>
</div>
</div>
</div>
<div id="footnotes"><hr /></div>
<div id="footer">
<div id="footer-text">
Last updated 2013-06-06 08:54:41 EDT
</div>
</div>
</body>
</html>
